Using a Semantic Concordance for Sense Identification
نویسندگان
چکیده
This paper proposes benchmarks for systems of automatic sense identification. A textual corpus in which open-class words had been tagged both syntactically and semantically was used to explore three statistical strategies for sense identification: a guessing heuristic, a most-frequent heuristic, and a co-occurrence heuristic. When no information about sense-frequencies was available, the guessing heuristic using the numbers of alternative senses in WordNet was correct 45% of the time. When statistics for sensefrequancies were derived from a semantic concordance, the assumption that each word is used in its most frequently occurring sense was correct 69% of the time; when that figure was calculated for polysemous words alone, it dropped to 58%. And when a cooccur~nce heuristic took advantage of prior occurrences of words together in the same sentences, little improvement was observed. The semantic concordance is still too small to estimate the potential limits of a co-occurrence heuristic. 1. I N T R O D U C T I O N It is generally recognized that systems for automatic seine identification should be evaluated against a null hypothesis. Gale, Church, and Yarowsky [1] suggest that the appropriate basis for comparison would be a system that assumes that each word is being used in its most frequently occurring se re . They review the literature on how well word-disambiguation programs perform; as a lower bound, they estimate that the most frequent sense of polysemous words would be correct 75% of the time, and they propose that any sense-identification system that does not give the correct sense of polysemous words more than 75% of the time would not be worth serious consideration. The value of setting such a lower bound is obvious. However, Gale" Church, and Yarowsky [I] do not make clear how they determined what the most frequently occurring senses are. In the absence of such information, a case can be made that the lower bound should be given by the proportion of monosemous words in the textual corpus. Although most words in a dictionary have only a single sense" it is the polysemons words that occur most frequently in speech and writing. This is true even when we ignore the small set of highly pelysemous closed-class words (pronouns, prepositions, auxiliary verbs, etc.) that play such an important structural role. For exampie, 82.3% of the opon-class words in WordNet [2] are monosemous, but only 27.2% of the open-class words in a sample of 103 passages from the Brown Corpus [3] were monosemous. * Hunter College and Graduate School of the City Univendty of New Ytz~k That is to say, 27% of the time no decision would be needed, but for the remaining 73% of the open-class words, the response would have to be "don't know." This is probably the lowest lower bound anyone would propose, although if the highly pelysemous, very frequently used closed-class words were included, it would be even lower. A better performance figure would result, of course, if, instead of responding "don' t know," the system were to guess. What is the percentage correct that you could expect to obtain by guessing7 2. T H E G U E S S I N G H E U R I S T I C A guessing strategy presumes the existence of a standard list of words and their senses, but it does not assume any knowledge of the relative frequencies of different senses of polysemous words. We adopted the lexical database WordNet [2] as a convenient online list of open-class words and their senses. Whenever a word is ancountered that has more than one sense in WordNet, a system with no other information could do no better than to select a sense at random. The guessing heuristic that we evaluated was defined as follows: on encountering a noun (other than a proper noun), verb, adjective, or adverb in the test material, look it up in WordNet. If the word is monosemous (has a single sense in WordNet), assign that sense to it. If the word is polysemous (has more than one sense in WordNet), choose a sense at random with a probability of l/n, where n is the number of different senses of that word. This guess.ing heuristic was then used with the sample of 103 passages from the Brown Corpus. Given the distribution of open-class words in those passages and the number of senses of each word in WordNet, estimating the probability of a correct sense identification is a straightforward calculation. The result was that 45.0% of the 101,284 guesses would be correct. When the percent correct was calculated for just the 76,067 polysemous word tokens, it was 26.8%. 3. T H E M O S T F R E Q U E N T H E U R I S T I C Data on sense frequencies do exist. During the 1930s, Lorge [4] hired students at Columbia University to count how often each of the senses in the Oxford English Dictionary occurred in some 4,500,000 running words of prose taken from magazines of the day. These and other word counts were used by Thomdike in writing the Thorndike-Barnhart Junior Dictionary [5], a dictionary for children that first appeared in 1935 and that was widely used in the public schools for many years. Not only was Thorndike able to limit his dictionary, to words in common use, but he was also able to list senses in the order of their frequency, thus insuring that the senses
منابع مشابه
Semantic Factors: Students’ Sense of Belonging to Outdoor School Spaces
School is an environment which brings out students’ hidden talents. Paying attention to an appropriate context and environment has a huge impact on achieving this goal. The purpose of this study was to determine and evaluate semantic factors provided by experts influence students’ sense of belonging at high school students in terms of Iranian experts. To this end, firstly data were collected th...
متن کاملA Semantic Concordance
A semantic concordance is a textual corpus and a lexicon So combined that every substantive word in the text is linked to its appropriate ~nse in the lexicon. Thus it can be viewed either as a corpus in which words have been tagged syntactically and semantically, or as a lexicon in which example sentences can be found for many definitions. A semantic concordance is being constructed to u s e in...
متن کاملIdentifying the semantic components that influence the creation and promotion of sense of place in high schools (Case Study: Boys' High School in District 1, Tehran)
Today, the process of developing and constructing educational spaces tends to be more inclined to physical dimensions and is painted against human beings and their characteristics. The result is a disconnect between the studentschr('39') relationship with the schools in particular and the educational environment in general. As such, they consider their students to be less part of the high schoo...
متن کاملIdentification and Prioritization of Effective Organizational Structure Components (Case Study of Regional Electric Companies of Iran)
The present research identifies and prioritizes the effective organizational structure components using content (theme) analysis and Delphi technique and hierarchical analysis (Case Study of Regional Electric Companies of Iran). The present study is applied based on purpose and is exploratory-survey based on the nature and method of data collection. The statistical population of this study inc...
متن کاملClass Based Sense Definition Model for Word Sense Tagging and Disambiguation
We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense amb...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1994